PROJECT DESCRIPTION

Objective:
To categorize the countries using socio-economic and health factors that determine the overall development of the country.

Problem Statement:
HELP International have been able to raise around $ 10 million. Now the CEO of the NGO needs to decide how to use this money strategically and effectively. So, CEO has to make decision to choose the countries that are in the direst need of aid. Hence, your Job as a Data scientist is to categorise the countries using some socio-economic and health factors that determine the overall development of the country. Then you need to suggest the countries which the CEO needs to focus on the most.

Context:
HELP International is an international humanitarian NGO that is committed to fighting poverty and providing the people of backward countries with basic amenities and relief during the time of disasters and natural calamities.

wine_beer wine_beer

DATA PREPARATION

Step 0 — Import Libraries

e1071, tidyverse, plotly, htmltools, devtools, caret, NbClust, reshape2, rvest, magrittr, stringr, cowplot, ggmap

Step 1 — Load Data

DATA DICTIONARY
* country: Name of the country
* child_mort: Death of children under 5 years of age per 1000 live births
* exports: Exports of goods and services per capita. Given as %age of the GDP per capita
* health: Total health spending per capita. Given as %age of GDP per capita
* imports: Imports of goods and services per capita. Given as %age of the GDP per capita
* income: Net income per person
* inflation: The measurement of the annual growth rate of the Total GDP
* life_expec: The average number of years a new born child would live if the current mortality patterns are to remain the same
* total_fer: The number of children that would be born to each woman if the current age-fertility rates remain the same.
* gdpp: The GDP per capita. Calculated as the Total GDP divided by the total population.

Peep First Five Rows

## # A tibble: 6 × 10
##   country             child_mort exports health imports income inflation life_expec
##   <chr>                    <dbl>   <dbl>  <dbl>   <dbl>  <dbl>     <dbl>      <dbl>
## 1 Afghanistan               90.2    10     7.58    44.9   1610      9.44       56.2
## 2 Albania                   16.6    28     6.55    48.6   9930      4.49       76.3
## 3 Algeria                   27.3    38.4   4.17    31.4  12900     16.1        76.5
## 4 Angola                   119      62.3   2.85    42.9   5900     22.4        60.1
## 5 Antigua and Barbuda       10.3    45.5   6.03    58.9  19100      1.44       76.8
## 6 Argentina                 14.5    18.9   8.1     16    18700     20.9        75.8
## # … with 2 more variables: total_fer <dbl>, gdpp <dbl>

Data Dimensions

## Shape: 167 10
## Columns: country child_mort exports health imports income inflation life_expec total_fer gdpp
## Country Labels: 'Afghanistan', 'Albania', 'Algeria', 'Angola', 'Antigua and Barbuda' ...

Step 2 — Check for Missing Values

## Total Missing Values: 0

Step 3 — Ensure Correct Data Types

No chr variables to convert to factor

## tibble [167 × 9] (S3: tbl_df/tbl/data.frame)
##  $ child_mort: num [1:167] 90.2 16.6 27.3 119 10.3 14.5 18.1 4.8 4.3 39.2 ...
##  $ exports   : num [1:167] 10 28 38.4 62.3 45.5 18.9 20.8 19.8 51.3 54.3 ...
##  $ health    : num [1:167] 7.58 6.55 4.17 2.85 6.03 8.1 4.4 8.73 11 5.88 ...
##  $ imports   : num [1:167] 44.9 48.6 31.4 42.9 58.9 16 45.3 20.9 47.8 20.7 ...
##  $ income    : num [1:167] 1610 9930 12900 5900 19100 18700 6700 41400 43200 16000 ...
##  $ inflation : num [1:167] 9.44 4.49 16.1 22.4 1.44 20.9 7.77 1.16 0.873 13.8 ...
##  $ life_expec: num [1:167] 56.2 76.3 76.5 60.1 76.8 75.8 73.3 82 80.5 69.1 ...
##  $ total_fer : num [1:167] 5.82 1.65 2.89 6.16 2.13 2.37 1.69 1.93 1.44 1.92 ...
##  $ gdpp      : num [1:167] 553 4090 4460 3530 12200 10300 3220 51900 46900 5840 ...

Step 4 — Explore Data Distributions

Scatter Matrix

Histogram Matrix

Correlation Matrix

Step 5 — Normalize the Data

## # A tibble: 6 × 9
##   child_mort exports health imports  income inflation life_expec total_fer
##        <dbl>   <dbl>  <dbl>   <dbl>   <dbl>     <dbl>      <dbl>     <dbl>
## 1     0.426   0.0495 0.359   0.258  0.00805    0.126       0.475    0.737 
## 2     0.0682  0.140  0.295   0.279  0.0749     0.0804      0.872    0.0789
## 3     0.120   0.192  0.147   0.180  0.0988     0.188       0.876    0.274 
## 4     0.567   0.311  0.0646  0.246  0.0425     0.246       0.552    0.790 
## 5     0.0375  0.227  0.262   0.338  0.149      0.0522      0.882    0.155 
## 6     0.0579  0.0940 0.391   0.0916 0.145      0.232       0.862    0.192 
## # … with 1 more variable: gdpp <dbl>

K-MEANS CLUSTERING

Step 6 — Initial K-Means Model

country_kmeans = kmeans(
  countries,
  centers=2,
  algorithm="Lloyd",
  iter.max=30
) 

Evaluate Cluster Quality

## Variance Explained: 0.393483

Step 7 — Visualize Clusters

Load Map Data

##        long      lat group order region subregion
## 1 -69.89912 12.45200     1     1  Aruba      <NA>
## 2 -69.89571 12.42300     1     2  Aruba      <NA>
## 3 -69.94219 12.43853     1     3  Aruba      <NA>
## 4 -70.00415 12.50049     1     4  Aruba      <NA>
## 5 -70.06612 12.54697     1     5  Aruba      <NA>
## 6 -70.05088 12.59707     1     6  Aruba      <NA>

Visualize Socio-Economic Clusters

Step 8 — Hyperparameter Tuning

Elbow Method

NbClust Method

Step 9 — Final Model (k=3)

final_kmeans <- kmeans(
    countries, 
    centers=3,
    algorithm="Lloyd",
    iter.max=30
)

Evaluate Cluster Quality

## Variance Explained: 0.547986

Visualize Socio-Economic Clusters

Step 10 — Visualize 3D Clusters

CONCLUSION